Code to clean the data file-by-file

Importing the necessary libraries

In [1]:
import pandas as pd
import csv
import string
import re
import nltk

nltk.download('stopwords')
nltk.download('names')
from nltk.corpus import stopwords
from nltk.corpus import names
from nltk import word_tokenize
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Aruna\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!
In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

%matplotlib inline
pd.set_option('display.max_colwidth', 150)

(A) Read the CSV File

In [3]:
df = pd.read_csv("C:\\Users\\Aruna\\Documents\\input\\Amazon RDS.csv")

df['description'] = df['description'].apply(lambda x: " ".join(x for x in str(x).split())) # converting to string
 
df.head(10)
Out[3]:
id label description
0 10472.0 Amazon RDS DBName must be null for engine: sqlserver-web??? Huh? I am trying to use the AWS web interface to create a simple RDS SQLServer instance and I get...
1 10472.0 Amazon RDS I'm getting the same error.... just something as simple as creating a new RDS Sql Server Express database... and it errors out with this message.....
2 10472.0 Amazon RDS Me too on this. Thanks to @smcleod for the tip, I have now created my database!
3 10472.0 Amazon RDS Im having the same issue. There are two "display:none" field found in my F12 developer chrome. But neither of them has the word "css". Edited by: ...
4 10472.0 Amazon RDS This should now be fixed. Please do let me know if you run into the same issue.
5 10472.0 Amazon RDS This does not always occur, but when this error comes up, you can go to the previous page for selecting the "DB engine version" and from the dropd...
6 10472.0 Amazon RDS Please note that this issue stil exists and thanks to the comments earlier I was able to unhide the DB name(where the same name as of DB instance ...
7 10471.0 Amazon RDS Aurora Postgresql: PG::InternalError: ERROR: could not open file "base/164 Hello all, Since about two days ago we've started seeing the following ...
8 10470.0 Amazon RDS RDS Monitoring FreeStorageSpace doesn't match data in Metrics I have a 100Gb database and when viewing the free storage space by hitting the monit...
9 10469.0 Amazon RDS PostgreSQL RDS verbose logging Hi all, How do I turn on verbose logging in PostgreSQL RDS instance? I would like to have logs where I can see all ...
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39107 entries, 0 to 39106
Data columns (total 3 columns):
id             39106 non-null float64
label          39107 non-null object
description    39107 non-null object
dtypes: float64(1), object(2)
memory usage: 916.6+ KB

Check out one sample post:

In [5]:
p = 2000

df['description'][p]
Out[5]:
'You might want to ask in the Aurora PostgreSQL forum at https://forums.aws.amazon.com/forum.jspa?forumID=227 While you are unlikely to get a date, its always nice to let the team know why you need the new minor as soon as they can make it available.'

Top 30 words + frequency of each:

In [6]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[6]:
the         118605
to           94173
I            53218
a            51368
and          43365
is           42125
in           33700
RDS          31388
of           31165
for          30355
on           27699
that         26280
you          25642
instance     24334
this         24108
it           22245
have         21034
with         18279
not          17420
be           16881
from         15229
are          15137
can          14106
your         13350
an           13191
but          12326
at           12078
my           11837
as           11392
database     11282
dtype: int64
In [7]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words before cleaning.")
There are totally 2983466 words before cleaning.

(B) Text Pre-processing

In [8]:
STOPWORDS = stopwords.words('english')
my_stop_words = ["hi", "hello", "regards", "thank", "thanks", "regard", "best", "wishes", "hey", "amazon", "aws", "s3",
"elastic", "beanstalk", "rds", "ec2", "lambda", "cloudfront", "cloud", "front", "vpc", "sns", "me",
"january", "february", "march", "april", "may", "june", "july", "august", "september", "october", 
"november", "december", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "sept", "oct", "nov",
"dec", "monday", "tuesday", "wednesday", "thursday", "friday", "saturday", "sunday", "mon", "tue",
"wed", "thu", "fri", "sat", "sun", "ain't", "aren't", "can't", "can't've", "'cause", "could've", "couldn't",
"couldn't've", "didn't", "doesn't", "don't", "hadn't", "hadn't've", "hasn't", "haven't", "he'd", "he'd've",
"he'll", "he'll've", "he's", "how'd", "how'd'y", "how'll", "how's", "i'd", "i'd've", "i'll", "i'll've", "i'm",
"i've", "isn't", "it'd", "it'd've", "it'll", "it'll've", "it's", "let's", "mayn't", "might've", "mightn't",
"mightn't've", "must've", "mustn't", "mustn't've", "needn't", "needn't've", "oughtn't", "oughtn't've", "shan't",
"sha'n't", "shan't've", "she'd", "she'd've", "she'll", "she'll've", "she's", "should've", "shouldn't", "shouldn't've",
"so've", "so's", "that'd", "that'd've", "that's", "there'd", "there'd've", "there's", "they'd", "they'd've", "they'll",
"they'll've", "they're", "they've", "to've", "wasn't", "we'd", "we'd've", "we'll", "we'll've", "we're", "we've",
"weren't", "what'll", "what'll've", "what're", "what's", "what've", "when's", "when've", "where'd", "where's",
"where've", "who'll", "who'll've", "who's", "who've", "why's", "why've", "will've", "won't", "won't've", "would've",
"wouldn't", "wouldn't've", "yall", "yalld", "yalldve", "yallre", "yallve", "youd", "youdve", "youll",
"youllve", "youre", "youve", "do", "did", "does", "had", "have", "has", "could", "can", "as", "is",
"shall", "should", "would", "will", "you", "me", "please", "know", "who", "we", "was", "were", "edited", "by", "pm"]

name = names.words()
STOPWORDS.extend(my_stop_words)
STOPWORDS.extend(name)

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,:;#+?]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z - _.]+')
REMOVE_HTML_RE = re.compile(r'<.*?>')
REMOVE_HTTP_RE = re.compile(r'http\S+')

STOPWORDS = [BAD_SYMBOLS_RE.sub('', x) for x in STOPWORDS]

Convert to lowercase

In [9]:
df['description'] = df['description'].apply(lambda x: " ".join(x.lower() for x in str(x).split(" ")))

df['description'][p]
Out[9]:
'you might want to ask in the aurora postgresql forum at https://forums.aws.amazon.com/forum.jspa?forumid=227 while you are unlikely to get a date, its always nice to let the team know why you need the new minor as soon as they can make it available.'

Remove all HTML tags

In [10]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTML_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[10]:
'you might want to ask in the aurora postgresql forum at https://forums.aws.amazon.com/forum.jspa?forumid=227 while you are unlikely to get a date, its always nice to let the team know why you need the new minor as soon as they can make it available.'
In [11]:
df['description'] = df['description'].apply(lambda x: " ".join(REMOVE_HTTP_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[11]:
'you might want to ask in the aurora postgresql forum at   while you are unlikely to get a date, its always nice to let the team know why you need the new minor as soon as they can make it available.'

Replace certain characters by space (quotation marks, parantheses etc)

In [12]:
df['description'] = df['description'].apply(lambda x: " ".join(REPLACE_BY_SPACE_RE.sub(' ', x) for x in str(x).split()))

df['description'][p]
Out[12]:
'you might want to ask in the aurora postgresql forum at while you are unlikely to get a date  its always nice to let the team know why you need the new minor as soon as they can make it available.'

Remove any unwanted symbols (like $, @ etc)

In [13]:
df['description'] = df['description'].apply(lambda x: " ".join(BAD_SYMBOLS_RE.sub('', x) for x in str(x).split()))

df['description'][p]
Out[13]:
'you might want to ask in the aurora postgresql forum at while you are unlikely to get a date its always nice to let the team know why you need the new minor as soon as they can make it available.'

Remove trailing punctuation marks and any symbol patterns

In [14]:
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('.') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('-') for x in x.split()))
df['description'] = df['description'].apply(lambda x: " ".join(x.strip('_') for x in x.split()))
df['description'][p]
Out[14]:
'you might want to ask in the aurora postgresql forum at while you are unlikely to get a date its always nice to let the team know why you need the new minor as soon as they can make it available'

Remove any numbers

In [15]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if not x.isdigit()))

df['description'][p]
Out[15]:
'you might want to ask in the aurora postgresql forum at while you are unlikely to get a date its always nice to let the team know why you need the new minor as soon as they can make it available'

Remove the stop words

In [16]:
df['description'] = df['description'].apply(lambda x: " ".join(x for x in x.split() if x not in STOPWORDS
                                                               and len(x) > 1))

df['description'][p]
Out[16]:
'might want ask aurora postgresql forum unlikely get date always nice let team need new minor soon make available'

Results after cleaning data:

In [17]:
df.head()
Out[17]:
id label description
0 10472.0 Amazon RDS dbname must null engine sqlserverweb huh trying use web interface create simple sqlserver instance get error dbname must null engine sqlserverweb ...
1 10472.0 Amazon RDS getting error something simple creating new sql server express database errors message support needs address dbname must null engine sqlserverex s...
2 10472.0 Amazon RDS smcleod tip created database
3 10472.0 Amazon RDS issue two display none field found f12 developer chrome neither word css nib01
4 10472.0 Amazon RDS fixed let run issue

Top 30 words + frequency of each:

In [18]:
pd.Series(' '.join(df['description']).split()).value_counts()[:30]
Out[18]:
instance     35374
mysql        15628
database     15436
db           13057
error        10082
server        9558
using         8473
issue         7456
time          7196
instances     6682
use           6620
aurora        6457
group         6447
sql           6120
see           6096
one           6072
connect       6031
data          5997
set           5954
new           5780
like          5426
create        5365
get           5305
need          5193
read          5127
also          5032
problem       4965
security      4913
running       4856
help          4535
dtype: int64
In [19]:
print("There are totally", df['description'].apply(lambda x: len(x.split(' '))).sum(), "words after cleaning.")
There are totally 1449226 words after cleaning.

(C) Write to CleanText.csv

In [20]:
with open('C:\\Users\\Aruna\\Documents\\ACMS-IID\\input\\CleanText.csv', 'a', encoding='utf-8', newline='') as csvfile:
    writer = csv.writer(csvfile)
    # writer.writerow(['id', 'label', 'description'])
    for i in range(0, len(df['description'])):
        if len(df['description'][i]) > 1:
            writer.writerow([df['id'][i], df['label'][i], df['description'][i]])

(D) Generate the word cloud

In [21]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 20, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[21]:
(-0.5, 399.5, 199.5, -0.5)
In [22]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 50, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[22]:
(-0.5, 399.5, 199.5, -0.5)
In [23]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 100, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[23]:
(-0.5, 399.5, 199.5, -0.5)
In [24]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 500, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[24]:
(-0.5, 399.5, 199.5, -0.5)
In [25]:
msgs = " ".join(str(msg) for msg in df['description'])
fig, ax = plt.subplots(1, 1, figsize  = (100,100))
wordcloud = WordCloud(max_font_size = 20, max_words = 1000, background_color = "white").generate(msgs)
ax.imshow(wordcloud, interpolation='bilinear')
ax.axis('off')
Out[25]:
(-0.5, 399.5, 199.5, -0.5)
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: